Smaller Proximate Search Index

ABSTRACT

A data management system accesses a set of vectors containing binary values and generates vector blocks comprising binary values from each vector. Each of at least a portion of the vector blocks for each vector contain a set of two or more binary values from the vector. The data management system generates a block index based on the vector blocks. The block index includes a set of vector block arrays, each vector block array corresponding to a position in the vectors and including the binary values of a vector block from each vector. The data management system can identify relevant vectors for a target vector by generating vector blocks from the target vector and querying the block index to identify candidate vectors.

CROSS-REFERENCE TO RELATED APPLICATIONS

This patent application is a continuation of U.S. patent applicationSer. No. 15/691,610, filed Aug. 30, 2017, which is incorporated hereinby reference in its entirety.

TECHNICAL FIELD

The present disclosure generally relates to the technical field ofspecial-purpose machines that facilitate proximate search, includingcomputerized variants of such special-purpose machines and improvementsto such variants, and to the technologies by which such special-purposemachines become improved compared to other special-purpose machines thatfacilitate proximate search. In particular, the present disclosureaddresses systems and methods for smaller proximate search.

BACKGROUND

Computing devices typically use advanced algorithms to represent dataobjects (e.g., images, audio files, text documents, etc.), as vectors.These vectors include multiple dimensions that each represent a featureof the data object. One use for these vectors is identifying matching orsimilar data objects. For example, distance functions are used toidentify vectors that are closest to a target vector representing atarget data object (e.g., k-nearest neighbors). The nearby vectorsindicate that the corresponding data objects either match or are similarto the target data object. While effective, these methods are resourceintensive for large data sets.

Current improvements include converting floating values in the vectorsto binary values, thereby reducing the size and complexity of thevectors. A hamming distance between the converted vectors is determinedto identify similar vectors. The hamming distance indicates the numberof positions that differ between two binary strings. A subset of vectorsthat have a hamming distance below a threshold are identified ascandidate vectors. The system then uses distance functions on thissmaller subset of candidate vectors, thereby reducing resource usage.

While these methods represent improvements, an additional reduction inresource usage is desirable. Accordingly, additional technicalimprovements are needed.

BRIEF DESCRIPTION OF THE DRAWINGS

Various ones of the appended drawings merely illustrate exampleembodiments of the present disclosure and are not intended to limit itsscope to the illustrated embodiments. On the contrary, these examplesare intended to cover alternatives, modifications, and equivalents asmay be included within the scope of the disclosure.

FIG. 1 shows an exemplary system for providing smaller proximate search,according to some example embodiments.

FIG. 2 shows vectors representing data object, according to some exampleembodiments.

FIG. 3 shows vectors representing data objects that have been convertedto binary values, according to some example embodiments.

FIGS. 4A-4F show generating vector blocks and a block index, accordingto some example embodiments.

FIG. 5 shows an example block diagram of the data manager, according tosome example embodiments.

FIG. 6 shows an example method for smaller proximate search, accordingto some example embodiments.

FIG. 7 shows a block diagram illustrating components of a computingdevice, according to some example embodiments, able to read instructionsfrom a machine-readable medium (e.g., a machine-readable storage medium)and perform any one or more of the methodologies discussed herein.

DETAILED DESCRIPTION

Reference will now be made in detail to specific example embodiments forcarrying out the inventive subject matter of the present disclosure. Inthe following description, specific details are set forth in order toprovide a thorough understanding of the subject matter. It shall beappreciated that embodiments may be practiced without some or all ofthese specific details.

Disclosed are systems, methods, and computer-readable storage media forsmaller proximate search. A data management system uses advancedalgorithms to represent data objects (e.g., images, audio files, textdocuments, etc.), as vectors. These vectors include multiple dimensionsthat each represent a feature of the data object. One use for thesevectors is identifying matching or similar data objects. For example,the data management system uses distance functions to determine thenearest reference vectors to a target vector. The nearest referencevectors represent data objects that are close to the target vectorrepresenting a target data object. Nearby vectors indicate that thecorresponding data objects match or are similar to the target dataobject. While effective, these methods are resource intensive for largedata sets.

To reduce resource usage, the data management system converts floatingvalues in the vectors to binary values, thereby reducing the size andcomplexity of the vectors. For example, the data management systemassigns a value of 1 to each floating value that meets or exceeds athreshold value, and a value of 0 to each floating value below thethreshold value. The resulting binary strings represent the vectors insimpler terms.

The data management system then determines hamming distances between theconverted vectors (i.e., binary strings) and a converted target vector.The hamming distance indicates the number of positions that differbetween two binary strings. The data management system identifies asubset of vectors that have a hamming distance below a threshold fromthe target vector, yielding a set of candidate vectors. The datamanagement system then uses distance functions on this smaller subset ofcandidate vectors, thereby reducing resource usage.

To increase efficiency in calculating the hamming distances, the datamanagement system generates a set of sequentially ordered vector blocksfor each of the vectors. Each vector block in a set of sequentiallyordered vector blocks contains a set of two or more binary values thatare ordered sequentially in the vector, as well a numerical vectoridentifier identifying the vector.

The data management system generates a block index based on the sets ofsequentially ordered vector blocks. The block index includes a set ofvector block arrays where each individual vector block arraycorresponding to one sequential position. Each of the vector blockarrays includes one vector block from each set of sequentially orderedvector blocks that correspond to the same sequential position. As anexample, the first vector block array includes vector blocks from thefirst sequential position (i.e., including the first and secondsequentially ordered binary values in the vector). The second vectorblock array includes the vector blocks from the second sequentialposition (i.e., including the third and fourth sequentially orderedbinary values in the vector) and so forth. The vector blocks in eachvector block array are ordered sequentially based on their binaryvalues, such as from lowest to highest or vice versa.

The data management system reduces the size of the block index bycombining sequential vector blocks in each vector array that include thesame binary values. The combined vector blocks include the binary valuesand the vector identifiers for the combined vector blocks. The datamanagement system orders the vector identifiers sequentially from lowestto highest.

To further reduce the size of the block index, the data managementsystem uses differential encoding to encode the values of the vectoridentifiers in each combined vector block as well as the binary valuesof each vector block.

FIG. 1 shows an exemplary system 100 for providing smaller proximatesearch, according to some example embodiments. While the system 100shown employs a client-server architecture, the present inventivesubject matter is, of course, not limited to such an architecture, andcould equally well find application in an event-driven, distributed, orpeer-to-peer architecture system, for example. Moreover, it shall beappreciated that although the various functional components of thesystem 100 are discussed in a singular sense, multiple instances of oneor more of the various functional components may be employed.

As shown, the system 100 can include multiple computing devicesconnected to a communication network 102 and configured to communicatewith each other through use of the communication network 102. Thecommunication network 102 can be any type of network, including a localarea network (“LAN”), such as an intranet, a wide area network (“WAN”),such as the internet, or any combination thereof. Further, thecommunication network 102 can be a public network, a private network, ora combination thereof. The communication network 102 can also beimplemented using any number of communication links associated with oneor more service providers, including one or more wired communicationlinks, one or more wireless communication links, or any combinationthereof. Additionally, the communication network 102 can be configuredto support the transmission of data formatted using any number ofprotocols.

Multiple computing devices can be connected to the communication network102. A computing device can be any type of general computing devicecapable of network communication with other computing devices. Forexample, a computing device can be a personal computing device such as adesktop or workstation, a business server, or a portable computingdevice, such as a laptop, smart phone, or a tablet PC. A computingdevice can include some or all of the features, components, andperipherals of computing device 700 of FIG. 7

To facilitate communication with other computing devices, a computingdevice can include a communication interface configured to receive acommunication, such as a request, data, etc., from another computingdevice in network communication with the computing device and pass thecommunication along to an appropriate module running on the computingdevice. The communication interface can also be configured to send acommunication to another computing device in network communication withthe computing device.

As shown, the system 100 includes a client device 104 and datamanagement system 106. In the system 100, a user can interact with thedata management system 106 through the client device 104 connected tothe communication network 102 by direct and/or indirect communication.The client device 104 can be any of a variety of types of computingdevices that include at least a display, a computer processor, andcommunication capabilities that provide access to the communicationnetwork 102 (e.g., a smart phone, a tablet computer, a personal digitalassistant (PDA), a personal navigation device (PND), a handheldcomputer, a desktop computer, a laptop or netbook, or a wearablecomputing device).

The data management system 106 can consist of one or more computingdevices and support connections from a variety of different types ofclient devices 104, such as desktop computers; mobile computers; mobilecommunications devices (e.g. mobile phones, smart phones, tablets,etc.); smart televisions; set-top boxes; and/or any other networkenabled computing devices. The client device 104 can be of varying type,capabilities, operating systems, etc. Furthermore, the data managementsystem 106 can concurrently accept connections from and interact withmultiple client devices 104.

A user can interact with the data management system 106 via aclient-side application 108 installed on the client device 104. In someembodiments, the client-side application 108 can include a datamanagement system 106 specific component. For example, the component canbe a stand-alone application, one or more application plug-ins, and/or abrowser extension. However, the user can also interact with the datamanagement system 106 via a third-party application 110, such as a webbrowser, that resides on the client device 104 and is configured tocommunicate with the data management system 106. In either case, theclient-side application 108 and/or the third party application 110 canpresent a user interface (UI) for the user to interact with the datamanagement system 106.

The data management system 106 can include a data storage 112 to storedata. The stored data can include any type of data, such as digitaldata, documents, text files, audio files, video files, etc. The datastorage 112 can be a storage device, multiple storage devices, or one ormore servers. Alternatively, the data storage 112 can be a cloud storageprovider or network storage. The data management system 106 can storedata in a network accessible storage (SAN) device, in a redundant arrayof inexpensive disks (RAID), etc. The date storage 110 can store dataitems using one or more partition types, such as FAT, FAT32, NTFS, EXT2,EXT3, EXT4, ReiserFS, BTRFS, and so forth.

The data management system 106 includes a data manager 114 configured toidentify similar data objects. A data object is any type of data such asan image file, audio file, text file, etc. The data manager 114 receivesa target data object and searches for matching or similar data objects.For example, the data manager 114 provides a user interface that enablesa user to provide target data object, such as an image, audio file,etc., and returns a listing of similar data objects.

To identify similar data objects, the data manager 114 converts a targetdata object into a vector representing the data object. The vectorincludes multiple dimensions that each represent a feature of the dataobject. The data manager 114 uses distance functions to identify vectorsthat are close to the target vector. Vectors that are nearby the targetvector represent data objects that are similar to the target dataobject. For example, the data storage 112 stores data objects and theircorresponding vectors. The data manager 114 returns any similar dataobjects in response to a request.

The data manager 114 is configured to implement multiple techniques toreduce the resources required to identify similar data objects. Thesetechniques are described in detail in reference to following figures.

FIG. 2 shows vectors representing data object, according to some exampleembodiments. As shown, each vector (V1, V2, V3, V4 . . . Vn) includesmultiple dimension values, each of which represents a feature of thedata object represented by the vector. The dimension values are floatingvalues. For example, vector V1 includes the dimension values, 0.1, 1.2,3.1, etc. As another example, the vector V2 includes the dimensionvalues 0.4, 1.6, 3.2, etc. The dimension values in the same sequentialposition within each vector correspond to the same feature of the dataobject represented by the vector. For example, the vector V1 includesthe value 0.1 for the first feature, whereas vector V2 includes thevalue 0.4 for the same first feature. These differences in valuesindicate differences in the data objects represented by each vector.

The size of the vectors is relatively large because each dimension valueis a floating value requiring 4 bytes of space. As an example, a largedata set with 1 billion items, each with 1024 dimensions, would requireapproximately 4 terabytes of data. To reduce the amount of space, thedata manager 114 converts the floating values to binary values. Forexample, the data manager 114 assigns each floating value below athreshold value as a 0 and each floating value that meets or exceeds thethreshold value as a 1. As a result of converting the floating values tobinary values, only 1 bit is required to store each dimension of thevector, which reduces the space needed considerably. For example, alarge data set with 1 billion items, each with 1024 dimensions, wouldrequire approximately 128 gigabytes of data, compared to 4 terabytesneeded prior to conversion.

FIG. 3 shows vectors representing data objects that have been convertedto binary values, according to some example embodiments. As shown, eachdimension value in the vectors is either a 0 or a 1. The data manager114 assigns a binary value to each dimension based on a thresholdnumber. For example, the data manager 114 compares the floating valuesassigned to each dimension of a vector to the threshold number anddetermines whether the floating value is below the threshold ormeets/exceeds the threshold value. If the floating value is below thethreshold, the data manager 114 assigns a binary value of 0 to thedimension in the vector. Alternatively, if the floating value meets orexceeds the threshold value, the data manager 114 assigns a binary valueof 1 to the dimension in the vector.

The data manager 114 determines the threshold value using any knowntechnique in the art. For example, the threshold value may a mean value,average value, etc., of the floating values assigned to the dimensionsin the vector.

The converted vectors are representations of the original vectors,however they are not exact. Accordingly, the data manager 114 uses theconverted vectors to identify a subset of the vectors as candidatevectors that are similar to a target vector. The data manager 114 thenperforms the full distance functions on the smaller subset of candidatevectors utilizing the original floating values assigned to the candidatevectors and the target vector. Performing the distance functions on asmaller subset of candidate vectors greatly reduces the resourcesrequired to identify nearby vectors as compared to performing thedistance functions on all of the vectors.

To identify the candidate vectors, the data manager 114 determines ahamming distance between a converted target vector and the otherconverted vectors. The hamming distance indicates the number ofpositions that differ between two binary strings. For example, todetermine the hamming distance between the vector V1 and the vector V2shown in FIG. 3, the data manager 114 compares the binary values at eachposition to identify the number of positions in which the two binaryvalues do not match. As show, only the first and third binary values ofvector V1 and vector V2 do not match. Hence, the data manager 114determines that the hamming distance between vector V1 and vector V2 is2. As another example, the first, fifth and sixth values of vectors V1and V3 do not match. Accordingly, the data manager 114 determines thatthe hamming distance between vector V1 and vector V3 is 3.

The data manager 114 uses a hamming distance threshold to identify thesubset of vectors that are candidate vectors. The hamming distancethreshold indicates a maximum hamming distance value for vectors to beincluded as candidates. For example, a hamming distance threshold of 2would result in the data manager 114 identifying all vectors that have ahamming distance of 2 or less as being candidate vectors. Any vectorswith a hamming distance value above 2 (i.e., 3 or more) would not beincluded as a candidate vector. As another example, a hamming distancethreshold of 1 would result in the data manager 114 identifying allvectors that have a hamming distance of 1 or less as being candidatevectors. Any vectors with a hamming distance value above 1 (i.e., 2 ormore) would not be included as a candidate vector.

To further reduce resource usage in determining the hamming distance,the data manager 114 generates a set of sequentially ordered vectorblocks for each vector and then uses the sets of sequentially orderedvector blocks to generate a block index. Each vector block in a set ofsequentially ordered vector blocks contains a set of two or more binaryvalues that are ordered sequentially in the vector, as well a numericalvector identifier identifying the vector.

The block index includes a set of vector block arrays where eachindividual vector block array corresponding to one sequential position.Each of the vector block arrays includes one vector block from each setof sequentially ordered vector blocks that correspond to the samesequential position. As an example, the first vector block arrayincludes vector blocks from the first sequential position (i.e.,including the first and second sequentially ordered binary values in thevector). The second vector block array includes the vector blocks fromthe second sequential position (i.e., including the third and fourthsequentially ordered binary values in the vector) and so forth. Thevector blocks in each vector block array are ordered sequentially basedon their binary values, such as from lowest to highest or vice versa.

In addition to generating the vector blocks, the data manager 114encodes the data included in the vector blocks to further reduce theirsize. The data manager uses the vector array to quickly identify vectorsthat have a hamming distance below a threshold distance. This isdiscussed in greater detail with respect to FIGS. 4A-4E.

FIGS. 4A-4E show generating vector blocks and a block index, accordingto some example embodiments. FIG. 4A shows a set of vector blocksgenerated for a vector V1. As shown, the vector V1 is broken into asequential set of vector blocks based in the sequential order of thebinary values included in the vector V1. Each vector block correspondsto a position of the vector V1 and includes two sequentially orderedbinary values from the position of the vector V1, as well as a vectoridentifier that identifies vector V1. For example, the vector block B1corresponds to the first position of the vector V1 and includes thefirst two binary values of the vector V1 (i.e., 0, 1). Similarly, thevector block B2 corresponds to the second position of the vector V1 andincludes the third and fourth binary values of the vector V1 (i.e., 0,0). Each of the vector blocks include the vector identifier V1 becauseeach vector block corresponds to the vector V1. Although the shownexample includes two sequentially ordered binary values in each vectorblock, this is just one example, and is not meant to be limiting. Thevector blocks cab be generated to include any number of two or moresequentially ordered binary values, such as 3, 4, 4, etc., although thenumber of sequentially ordered binary values in each vector block shouldbe the same.

The data manager 114 generates a set of sequential vector blocks foreach of the vectors. The data manager 114 then uses the vector blocks togenerate a block index. The block index includes a set of block arrays.Each block array in the block index corresponds to one position in thevectors and includes a vector block from each vector that corresponds tothe position. Further, the vector blocks in each block array are orderedsequentially based on the binary values included in the respectivevector blocks. For example, the vector blocks are ordered from lower tohighest, or vice versa.

FIG. 4B shows a block index. As show, the block index includes fourblock arrays, each corresponding to a different position in the vectors.For example, the first block array corresponds to the first position inthe vectors (i.e., B1) and includes a vector block from each vectorcorresponding to position B1. The second block array corresponds to thesecond position in the vectors (i.e., B2) and includes a vector blockfrom each vector corresponding to position B2. The third block arraycorresponds to the third position in the vectors (i.e., B3) and includesa vector block from each vector corresponding to position B3. Finally,the fourth block array corresponds to the fourth position in the vectors(i.e., B4) and includes a vector block from each vector corresponding toposition B4.

As shown, the vector blocks in each block array are ordered sequentiallybased on the binary values included in the vector blocks. For example,the first block array starts with the vector block from the vector V4that includes the binary value 0, 0, then moves to the vector block fromthe vector V1 that includes the binary value 0, 1, and finally includesthe vector blocks from the vectors V2 and V3 that both include thebinary value 1, 1. Block arrays that include the same binary values areordered sequentially according to the vector identifier from lowest tohighest.

To determine the hamming distance between a target vector and thevectors included in the block index, the data manager 114 performs abinary search of each block array based on the corresponding binaryvalues of the target vector. For example, given the shown target vectorof 0, 0, 0, 1, 0, 0, 1, 1, the data manager 114 would perform a binarysearch of each block array based on the binary values of the targetvector that are in the corresponding positions. For example, the datamanager 114 performs a binary search of the first block index thatcorresponds to position B1 based on the binary values in the targetvector that are in position B1 (i.e., 0, 0). Likewise, the data manager114 performs a binary search of the second block index that correspondsto position B2 based on the binary values in the target vector that arein position B2 (i.e., 0, 1).

The data manager 114 performs the binary search to identify matchingvector blocks and adds the matching vector blocks to the candidate list.For example, a binary search of the first block array would result in amatch of the vector V4 that includes binary values 0, 0. The datamanager 114 then adds the vector V4 to the candidate list. Likewise, abinary search of the second array would also result in the match of thevector V4 that includes the binary values 0, 1. The data manager 114repeats this process for each block array and adds and removes vectorsfrom the candidate list according to desired implementations. Forexample, the vectors are added based on the desired threshold hammingdistance. Once the data manager 114 has identified the set of candidatevectors, the data manager then performs a full distance search of thecandidate vectors to identify the vectors that are nearest to the targetvector. Performing the full distance on the smaller set of candidatevectors rather than the full set of vectors greatly reduces resourceusage and latency in identifying the nearest vectors.

The data manager 114 further reduces resource usage by reducing the sizeof the block index. For example, the data manager 114 consolidatesvector blocks that include the same binary value and encodes the binaryvalues and vector identifiers.

As shown in FIG. 4C, the first block array includes two vector blocksthat include the binary values 1, 1. Further, the second block arrayincludes two vector blocks that include the binary values 0, 0. Toreduce the amount of space used, the data manager can combine thesevector blocks into a single vector block. The combined vector blockincludes the binary values as well as the vector identifiers for each ofthe combined vector blocks.

FIG. 4D shows combined vector blocks. As shown, the two vector blocks inthe first block array have been combined into a single vector block. Thesingle vector block includes one set of the binary values 1, 1, as wellas the vector identifiers for the combined vector block (i.e., V2, V3).Likewise, the two vector blocks in the second block array have beencombined into a single vector block. The single vector block includesone set of the binary values 0, 0, as well as the vector identifiers forthe combined vector block (i.e., V1, V3). Combining vector blocks withthe same binary values reduces the overall space utilized for the vectorblocks. This reduction in size amounts to considerable resource savingswhen dealing with large data sets.

To further reduce the size of the combined vector blocks, the datamanager 114 encodes the vector identifiers. Encoding the vectoridentifiers represents the vector identifiers based on the change fromthe previous vector identifier. For example, to encode the sequence 2,7, 10, the data manager represents the first number as its originalvalue (i.e., 2) and then represents the other two numbers based on thechange in value from the previous number. For example, the second number7 is 5 greater than the previous number 2 and the third number 10 is 3greater than the previous number 7. Hence, the encoded sequence torepresent 2, 7, 10 would be 2, 5, 3. Encoding the numerical values inthis manner reduces the overall size needed to store these values,thereby reducing the overall size of the vector blocks.

FIG. 4E shows the block index with encoded vector identifiers. As shown,the vector identifiers for the combined vector blocks in the firstvector array have been encoded to the values 2, 1. This represents thatthe first vector identifier is for the vector V2 and the second vectoridentifier is one more than the previous value (i.e., the vector V3).Likewise, the vector identifiers for the combined vector blocks in thesecond vector array have been encoded to the values 1, 2. Thisrepresents that the first vector identifier is for the vector V1 and thesecond vector identifier is two more than the previous value (i.e., thevector V3). Encoding the vector identifiers in the combined vectorblocks reduces the space required to store each vector identifier, whichrepresents a substantial reduction in data in large data sets.

To further reduce the size of the block index, the data manager encodesthe binary values of the vector blocks in each block array. The vectorblocks in each vector array are ordered sequentially based on the binaryvalues included in the vector blocks from lowest to highest. Rather thanrepresent each set of binary values individually, the data managerrepresents the binary values based on the difference between the binaryvalue and the previous binary value. For example, given the sequence 00,00, 01, 10 and 11, the data manager 114 encodes the values based on thedifference between the values and the previous value. The first valueremains as 00, the second value is assigned 0 because there is no changefrom the previous value, the third value is assigned as 1 because itrepresents an increase of 1 from the previous value, the fourth valuesis assigned as 1 because it represents an increase of 1 from theprevious value, and the fifth value is assigned as 1 because itrepresents an increase of 1 from the previous value. As a result, thebinary sequence 00, 00, 01, 10, 11 is encoded as 00, 0, 1, 1, 1.Encoding the binary values further reduces the size of the individualvector blocks, which represents a substantial reduction in size in largedata sets.

FIG. 4F shows the block index with encoded binary values. As shown, thefirst block array that previously had a sequence of 00, 01, 11 has beenencoded to 00, 1, 2. The encoded values are based on the differencebetween the value and the previous value. For example, the differencebetween 00 and 01 is 1, so the value is encoded as 1 to represent thedifference. Similarly, the difference between 01 and 11 is 2, so thevalue is encoded as 2 to represent the difference.

The second block array has also been encode. The second block arraypreviously had a value of 00, 01, 10 and has been encoded as 00, 1, 1.The encoded values are based on the difference between the value and theprevious value. For example, the difference between 00 and 01 is 1, sothe value is encoded as 1 to represent the difference. Similarly, thedifference between 01 and 10 is 1, so the value is encoded as 1 torepresent the difference.

FIG. 5 shows an example block diagram of the data manager 114, accordingto some example embodiments. To avoid obscuring the inventive subjectmatter with unnecessary detail, various functional components (e.g.,modules) that are not germane to conveying an understanding of theinventive subject matter have been omitted from FIG. 5. However, askilled artisan will readily recognize that various additionalfunctional components may be supported by data manager 114 to facilitateadditional functionality that is not specifically described herein.Furthermore, the various functional modules depicted in FIG. 5 mayreside on a single computing device or may be distributed across severalcomputing devices in various arrangements such as those used incloud-based architectures.

As shown, data manager 114 includes an interface module 502, a vectorconversion module 504, a binary conversion module 506, a vector blockgeneration module 508, a block index generation module 510, a vectorblock grouping module 512, an encoding module 514, a candidatedetermination module 516, and a distance determination module 518.

The interface module 502 provides a user with a data managementinterface that enables the user to submit a search query based on a dataobject. The data management interface provides the user with userinterface elements, such as a text boxes, buttons, check boxes, etc.,that allow a user to submit a data object such as an image, file, audiofile, etc. The submitted data object is a target data that the datamanager 114 uses as a basis to search for other similar data objects.The data management interface presents the user with a listing of dataobjects returned in response to the user's search query.

The vector conversion module 504 generates a vector that represents adata object. The resulting vectors include multiple dimensions that eachrepresent a feature of the data object. Each dimension includes afloating value that represents the corresponding feature.

The binary conversion module 506 converts the floating values includedin a vector to binary values. The binary conversion module 506 utilizesa threshold value to convert the floating values to either a 1 or a 0.For example, the binary conversion module 506 converts the floatingvalues that are below the threshold value to 0 and converts the floatingvalues that meet or exceed the threshold value to 1.

The vector block generation module 508 generates a set sequentiallyordered vector blocks for a vector. Each vector block in a set ofsequentially ordered vector blocks contains a set of two or more binaryvalues that are ordered sequentially in the vector, as well a numericalvector identifier identifying the vector.

The block index generation module 510 generates a block index based onthe sets of sequentially ordered vector blocks. The block index includesa set of vector block arrays where each individual vector block arraycorresponds to one sequential position in the vector. Each of the vectorblock arrays includes one vector block from each set of sequentiallyordered vector blocks that correspond to the same sequential position.As an example, the first vector block array includes vector blocks fromthe first sequential position (i.e., including the first and secondsequentially ordered binary values in the vector). The second vectorblock array includes the vector blocks from the second sequentialposition (i.e., including the third and fourth sequentially orderedbinary values in the vector) and so forth. The vector blocks in eachvector block array are ordered sequentially based on their binaryvalues, such as from lowest to highest or vice versa. Vector blocks withthe same binary values are ordered sequentially based on the vectoridentifier values.

The vector block grouping module 512 groups vector blocks in each vectorarray that include the same binary values. The combined vector blockincludes a single set of the binary values as well as vector identifiersfor each of the combined vector blocks. The vector identifiers areordered sequentially from lowest to highest.

The encoding module 514 encodes the vector identifiers in the combinedvector blocks as well as the binary values in each vector block. Theencoding module 514 utilized differential encoding in which a value isencoded a value based on the difference between the value and theprevious sequentially ordered value. For example, vector identifiers fora vector V5 and V9 would be encoded as 5, 5. The first value of 5 torepresents the value of the first vector identifier (i.e., V5) and thesecond value of 4 represents the difference between the second vectoridentifier (i.e., V9) and the first vector identifier (i.e., V5). Theencoding module 514 similarly encodes the binary values included in eachvector block. For example, the sequence 00, 01, 01, 11 is encoded as 00,1, 0, 2. The initial 00 represents the first value, the second value of1 represents the difference between 01 and 00, the third value of 0represents that there is no change between the third and fourth values,and the fourth value of 2 represents that the difference between 11 and01.

The candidate determination module 516 uses the block index to determinea set of candidate vectors that may be similar to a target vector. Thecandidate determination module 516 identifies the set of candidatevectors by determining a set of vectors that have a hamming distancebelow a threshold hamming distance. The hamming distance indicates thenumber of positions that differ between two binary strings.

To increase efficiency in identifying the vectors with a hammingdistance below the threshold distance, the candidate determinationmodule 516 performs a binary search of each block array based on thecorresponding binary values of the target vector. For example, given atarget vector of 0, 0, 0, 1, 0, 0, 1, 1, the candidate determinationmodule 516 would perform a binary search of each block array based onthe binary values of the target vector that are in the correspondingpositions. For example, the candidate determination module 516 performsa binary search of the first block index that corresponds to position B1based on the binary values in the target vector that are in position B1(i.e., 0, 0). Likewise, the candidate determination module 516 performsa binary search of the second block index that corresponds to positionB2 based on the binary values in the target vector that are in positionB2 (i.e., 0, 1).

The candidate determination module 516 performs the binary search toidentify matching vector blocks and adds the matching vector blocks tothe candidate list. For example, if a binary search of the first blockarray results in a match of a vector V4 that includes binary values 0,0. The candidate determination module 516 then adds the vector V4 to thecandidate list. The candidate determination module 516 repeats thisprocess for each block array and adds and removes vectors from thecandidate list according to desired implementations. For example, thevectors are added or removed based on the desired threshold hammingdistance.

The distance determination module 518 determines the distance between atarget vector and the set of candidate vectors. The distancedetermination module 518 uses distance functions to identify vectorsrepresenting data objects that are close to the target vectorrepresenting a target data object. Nearby vectors indicate that thecorresponding data objects match or are similar to the target dataobject. The distance determination module 518 uses the original versionof the candidate vectors and the target vector that include floatingvalues to determine the nearest vectors. The vectors identified as beingnearest to the target vector are presented to the user as searchresults.

FIG. 6 shows an example method for smaller proximate search, accordingto some example embodiments. The method 600 may be embodied in computerreadable instructions for execution by one or more processors such thatthe operations of the method 600 may be performed in part or in whole bydata manager 114; accordingly, the method 600 is described below by wayof example with reference thereto. However, it shall be appreciated thatat least some of the operations of the method 600 may be deployed onvarious other hardware configurations and the method 600 is not intendedto be limited to data manager 114.

At operation 602, the data manager 114 accesses a set of vectors. Eachvector contains binary values and is assigned a numerical vectoridentifier. Each vector represents a data object. For example, thevector conversion module 504 generates the set of vectors to represent aset of data objects. Each resulting vector includes floating valuesrepresenting features of the corresponding data object. The binaryconversion module 506 converts the floating values to binary values,resulting in the set of vectors.

At operation 604, the vector block generation module 508 generates a setof sequentially ordered vector blocks for each vector. Each vector blockin a set of sequentially ordered vector blocks contains a set of two ormore binary values ordered sequentially in the corresponding vector andthe numerical vector identifier for the vector.

At operation 606, the block index generation module 510 generates ablock index based on each corresponding set of sequentially orderedvector blocks. The block index includes a set of vector block arrays.Each vector block array corresponds to a respective sequential positionand includes one vector block from each of the corresponding sets ofsequentially ordered vector blocks that are in the respective sequentialposition. The vector blocks in each vector block array are orderedsequentially based on the two or more sequential binary values in eachrespective vector block.

At operation 608, the vector block grouping module 512 combines pairs ofsequentially ordered vector blocks that contain matching sets of twobinary values into combined vector blocks. Each combined vector blockcontains the respective matching set of two or more binary values andthe numerical vector identifiers assigned to the respective pair ofsequentially ordered vector blocks that were combined to form thecombined vector block. The numerical vector identifiers included in eachcombined vector block are ordered sequentially from lowest to highest.

To further reduce the size of the block index, the encoding module 514applies differential encoding to the numerical vector identifiersassigned to each pair of sequentially ordered vector blocks that werecombined to form the combined vector. Further the encoding module 514applies differential encoding based on the respective sets of two ormore sequential binary values included in each vector block in thevector block array.

FIG. 7 shows a block diagram illustrating components of a computingdevice 700, according to some example embodiments, able to readinstructions from a machine-readable medium (e.g., a machine-readablestorage medium) and perform any one or more of the methodologiesdiscussed herein. Specifically, FIG. 7 shows a diagrammaticrepresentation of a computing device 700 in the example form of asystem, within which instructions 702 (e.g., software, a program, anapplication, an applet, an app, a driver, or other executable code) forcausing a computing device 700 to perform any one or more of themethodologies discussed herein may be executed. For example, theinstructions 702 include executable code that causes the computingdevice 700 to execute a method 600. In this way, these instructionstransform the general, non-programmed machine into a particular machineprogrammed to carry out the described and illustrated functions in themanner described herein. The computing device 700 may operate as astandalone device or may be coupled (e.g., networked) to other machines.

By way of non-limiting example, the computing device 700 may comprise orcorrespond to a television, a computer (e.g., a server computer, aclient computer, a personal computer (PC), a tablet computer, a laptopcomputer, or a netbook), a set-top box (STB), a personal digitalassistant (PDA), an entertainment media system (e.g., an audio/videoreceiver), a cellular telephone, a smart phone, a mobile device, awearable device (e.g., a smart watch), a portable media player, or anymachine capable of outputting audio signals and capable of executing theinstructions 702, sequentially or otherwise, that specify actions to betaken by the computing device 700. Further, while only a singlecomputing device 700 is illustrated, the term “machine” shall also betaken to include a collection of computing devices 700 that individuallyor jointly execute the instructions 702 to perform any one or more ofthe methodologies discussed herein.

The computing device 700 may include processors 704, a memory 706, astorage unit 708 and I/O components 710, which may be configured tocommunicate with each other such as via a bus 712. In an exampleembodiment, the processors 704 (e.g., a central processing unit (CPU), areduced instruction set computing (RISC) processor, a complexinstruction set computing (CISC) processor, a graphics processing unit(GPU), a digital signal processor (DSP), an application specificintegrated circuit (ASIC), a radio-frequency integrated circuit (RFIC),another processor, or any suitable combination thereof) may include, forexample, both processors 714 and 716 that may execute the instructions702. The term “processor” is intended to include multi-core processorsthat may comprise two or more independent processors (sometimes referredto as “cores”) that may execute instructions contemporaneously. AlthoughFIG. 7 shows multiple processors, the computing device 700 may include asingle processor with a single core, a single processor with multiplecores (e.g., a multi-core process), multiple processors with a singlecore, multiple processors with multiples cores, or any combinationthereof.

The memory 706 (e.g., a main memory or other memory storage) and thestorage unit 708 are both accessible to the processors 704 such as viathe bus 712. The memory 706 and the storage unit 708 store instructions702 embodying any one or more of the methodologies or functionsdescribed herein. In some embodiments, a database 716 resides on thestorage unit 708. The instructions 702 may also reside, completely orpartially, within the memory 706, within the storage unit 708, within atleast one of the processors 704 (e.g., within the processor's cachememory), or any suitable combination thereof, during execution thereofby the computing device 700. Accordingly, the memory 706, the storageunit 708, and the memory of the processors 704 are examples ofmachine-readable media.

As used herein, “machine-readable medium” means a device able to storeinstructions and data temporarily or permanently and may include, but isnot be limited to, random-access memory (RAM), read-only memory (ROM),buffer memory, flash memory, optical media, magnetic media, cachememory, other types of storage (e.g., erasable programmable read-onlymemory (EEPROM)), or any suitable combination thereof. The term“machine-readable medium” should be taken to include a single medium ormultiple media (e.g., a centralized or distributed database, orassociated caches and servers) able to store the instructions 702. Theterm “machine-readable medium” shall also be taken to include anymedium, or combination of multiple media, that is capable of storinginstructions (e.g., the instructions 702) for execution by a machine(e.g., the computing device 700), such that the instructions, whenexecuted by one or more processors of the computing device 700 (e.g.,processors 704), cause the computing device 700 to perform any one ormore of the methodologies described herein (e.g., method 600).Accordingly, a “machine-readable medium” refers to a single storageapparatus or device, as well as “cloud-based” storage systems or storagenetworks that include multiple storage apparatus or devices. The term“machine-readable medium” excludes signals per se.

Furthermore, the “machine-readable medium” is non-transitory in that itdoes not embody a propagating signal. However, labeling the tangiblemachine-readable medium as “non-transitory” should not be construed tomean that the medium is incapable of movement—the medium should beconsidered as being transportable from one real-world location toanother. Additionally, since the machine-readable medium is tangible,the medium may be considered to be a machine-readable device.

The I/O components 710 may include a wide variety of components toreceive input, provide output, produce output, transmit information,exchange information, capture measurements, and so on. The specific I/Ocomponents 710 that are included in a particular machine will depend onthe type of machine. For example, portable machines such as mobilephones will likely include a touch input device or other such inputmechanisms, while a headless server machine will likely not include sucha touch input device. It will be appreciated that the I/O components 710may include many other components that are not specifically shown inFIG. 7. The I/O components 710 are grouped according to functionalitymerely for simplifying the following discussion and the grouping is inno way limiting. In various example embodiments, the I/O components 710may include input components 718 and output components 720. Inputcomponents 718 may include alphanumeric input components (e.g., akeyboard, a touch screen configured to receive alphanumeric input, aphoto-optical keyboard, or other alphanumeric input components), pointbased input components (e.g., a mouse, a touchpad, a trackball, ajoystick, a motion sensor, or other pointing instrument), tactile inputcomponents (e.g., a physical button, a touch screen that provideslocation and/or force of touches or touch gestures, or other tactileinput components), audio input components, and the like. Outputcomponents 720 may include visual components (e.g., a display such as aplasma display panel (PDP), a light emitting diode (LED) display, aliquid crystal display (LCD), a projector, or a cathode ray tube (CRT)),acoustic components (e.g., speakers), haptic components (e.g., avibratory motor, resistance mechanisms), other signal generators, and soforth.

Communication may be implemented using a wide variety of technologies.I/O components 710 may include communication components 722 operable tocouple the computing device 700 to the network 724 or the devices 726via a coupling 728 and a coupling 730, respectively. For example, thecommunication components 722 may include a network interface componentor other suitable device to interface with the network 724. In furtherexamples, the communication components 722 may include wiredcommunication components, wireless communication components, cellularcommunication components, near field communication (NFC) components,Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components,and other communication components to provide communication via othermodalities. The devices 726 may be another machine or any of a widevariety of peripheral devices (e.g., a peripheral device coupled via aUniversal Serial Bus (USB)).

Certain embodiments are described herein as including logic or a numberof components, modules, or mechanisms. Modules may constitute eithersoftware modules (e.g., code embodied on a machine-readable medium or ina transmission signal) or hardware modules. A hardware module is atangible unit capable of performing certain operations and may beconfigured or arranged in a certain manner In example embodiments, oneor more computer systems (e.g., a standalone, client, or server computersystem) or one or more hardware modules of a computer system (e.g., aprocessor or a group of processors) may be configured by software (e.g.,an application or application portion) as a hardware module thatoperates to perform certain operations as described herein.

In various embodiments, a hardware module may be implementedmechanically or electronically. For example, a hardware module maycomprise dedicated circuitry or logic that is permanently configured(e.g., as a special-purpose processor, such as a field-programmable gatearray (FPGA) or an application-specific integrated circuit (ASIC)) toperform certain operations. A hardware module may also compriseprogrammable logic or circuitry (e.g., as encompassed within ageneral-purpose processor or other programmable processor) that istemporarily configured by software to perform certain operations. Itwill be appreciated that the decision to implement a hardware modulemechanically, in dedicated and permanently configured circuitry, or intemporarily configured circuitry (e.g., configured by software) may bedriven by cost and time considerations.

Accordingly, the term “hardware module” should be understood toencompass a tangible entity, be that an entity that is physicallyconstructed, permanently configured (e.g., hardwired) or temporarilyconfigured (e.g., programmed) to operate in a certain manner and/or toperform certain operations described herein. Considering embodiments inwhich hardware modules are temporarily configured (e.g., programmed),each of the hardware modules need not be configured or instantiated atany one instance in time. For example, where the hardware modulescomprise a general-purpose processor configured using software, thegeneral-purpose processor may be configured as respective differenthardware modules at different times. Software may accordingly configurea processor, for example, to constitute a particular hardware module atone instance of time and to constitute a different hardware module at adifferent instance of time.

Hardware modules can provide information to, and receive informationfrom, other hardware modules. Accordingly, the described hardwaremodules may be regarded as being communicatively coupled. Where multipleof such hardware modules exist contemporaneously, communications may beachieved through signal transmission (e.g., over appropriate circuitsand buses that connect the hardware modules). In embodiments in whichmultiple hardware modules are configured or instantiated at differenttimes, communications between such hardware modules may be achieved, forexample, through the storage and retrieval of information in memorystructures to which the multiple hardware modules have access. Forexample, one hardware module may perform an operation and store theoutput of that operation in a memory device to which it iscommunicatively coupled. A further hardware module may then, at a latertime, access the memory device to retrieve and process the storedoutput. Hardware modules may also initiate communications with input oroutput devices, and can operate on a resource (e.g., a collection ofinformation).

The various operations of example methods described herein may beperformed, at least partially, by one or more processors that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Whether temporarily or permanentlyconfigured, such processors may constitute processor-implemented modulesthat operate to perform one or more operations or functions. The modulesreferred to herein may, in some example embodiments, compriseprocessor-implemented modules.

Similarly, the methods described herein may be at least partiallyprocessor-implemented. For example, at least some of the operations of amethod may be performed by one or more processors orprocessor-implemented modules. The performance of certain of theoperations may be distributed among the one or more processors, not onlyresiding within a single machine, but deployed across a number ofmachines. In some example embodiments, the processor or processors maybe located in a single location (e.g., within a home environment, anoffice environment, or a server farm), while in other embodiments theprocessors may be distributed across a number of locations.

The one or more processors may also operate to support performance ofthe relevant operations in a “cloud computing” environment or as a“software as a service” (SaaS). For example, at least some of theoperations may be performed by a group of computers (as examples ofmachines including processors), with these operations being accessiblevia a network (e.g., the Internet) and via one or more appropriateinterfaces (e.g., APIs).

Electronic Apparatus and System

Example embodiments may be implemented in digital electronic circuitry,or in computer hardware, firmware, or software, or in combinations ofthem. Example embodiments may be implemented using a computer programproduct, for example, a computer program tangibly embodied in aninformation carrier, for example, in a machine-readable medium forexecution by, or to control the operation of, data processing apparatus,for example, a programmable processor, a computer, or multiplecomputers.

A computer program can be written in any form of programming language,including compiled or interpreted languages, and it can be deployed inany form, including as a standalone program or as a module, subroutine,or other unit suitable for use in a computing environment. A computerprogram can be deployed to be executed on one computer or on multiplecomputers at one site, or distributed across multiple sites andinterconnected by a communication network.

In example embodiments, operations may be performed by one or moreprogrammable processors executing a computer program to performfunctions by operating on input data and generating output. Methodoperations can also be performed by, and apparatus of exampleembodiments may be implemented as, special purpose logic circuitry(e.g., an FPGA or an ASIC).

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. Inembodiments deploying a programmable computing system, it will beappreciated that both hardware and software architectures meritconsideration. Specifically, it will be appreciated that the choice ofwhether to implement certain functionality in permanently configuredhardware (e.g., an ASIC), in temporarily configured hardware (e.g., acombination of software and a programmable processor), or in acombination of permanently and temporarily configured hardware may be adesign choice. Below are set out hardware (e.g., machine) and softwarearchitectures that may be deployed, in various example embodiments.

Language

Although the embodiments of the present invention have been describedwith reference to specific example embodiments, it will be evident thatvarious modifications and changes may be made to these embodimentswithout departing from the broader scope of the inventive subjectmatter. Accordingly, the specification and drawings are to be regardedin an illustrative rather than a restrictive sense. The accompanyingdrawings that form a part hereof show by way of illustration, and not oflimitation, specific embodiments in which the subject matter may bepracticed. The embodiments illustrated are described in sufficientdetail to enable those skilled in the art to practice the teachingsdisclosed herein. Other embodiments may be used and derived therefrom,such that structural and logical substitutions and changes may be madewithout departing from the scope of this disclosure. This DetailedDescription, therefore, is not to be taken in a limiting sense, and thescope of various embodiments is defined only by the appended claims,along with the full range of equivalents to which such claims areentitled.

Such embodiments of the inventive subject matter may be referred toherein, individually and/or collectively, by the term “invention” merelyfor convenience and without intending to voluntarily limit the scope ofthis application to any single invention or inventive concept if morethan one is in fact disclosed. Thus, although specific embodiments havebeen illustrated and described herein, it should be appreciated that anyarrangement calculated to achieve the same purpose may be substitutedfor the specific embodiments shown. This disclosure is intended to coverany and all adaptations or variations of various embodiments.Combinations of the above embodiments, and other embodiments notspecifically described herein, will be apparent, to those of skill inthe art, upon reviewing the above description.

All publications, patents, and patent documents referred to in thisdocument are incorporated by reference herein in their entirety, asthough individually incorporated by reference. In the event ofinconsistent usages between this document and those documents soincorporated by reference, the usage in the incorporated referencesshould be considered supplementary to that of this document; forirreconcilable inconsistencies, the usage in this document controls.

In this document, the terms “a” or “an” are used, as is common in patentdocuments, to include one or more than one, independent of any otherinstances or usages of “at least one” or “one or more.” In thisdocument, the term “or” is used to refer to a nonexclusive or, such that“A or B” includes “A but not B,” “B but not A,” and “A and B,” unlessotherwise indicated. In the appended claims, the terms “including” and“in which” are used as the plain-English equivalents of the respectiveterms “comprising” and “wherein.” Also, in the following claims, theterms “including” and “comprising” are open-ended; that is, a system,device, article, or process that includes elements in addition to thoselisted after such a term in a claim are still deemed to fall within thescope of that claim.

The invention claimed is:
 1. A computer-implemented method comprising:accessing a set of vectors, each vector comprising binary values; foreach vector from the set of vectors, generating a set of vector blockswith each vector block corresponding with a position in the vector andeach of at least a portion of the vector blocks comprising two or moreof the binary values from the vector; and generating, using one or moreprocessors, a block index having a set of block arrays with each vectorblock array identifying, for a corresponding position in the vectors,binary values of the vector block from the corresponding position foreach vector.
 2. The computer-implemented method of claim 1, whereinaccessing the set of vectors comprises: accessing a set of floatingvalue vectors; and converting floating values in the floating valuesvectors to binary values to provide the set of vectors.
 3. Thecomputer-implemented method of claim 1, wherein generating the set ofvector blocks for each vector comprises assigning a vector identifierfor the vector to each vector block for the vector.
 4. Thecomputer-implemented method of claim 1, wherein generating the blockindex comprises: for a first vector block array corresponding with afirst position in the vectors, storing a vector block from the firstcorresponding position for each vector from the set of vectors.
 5. Thecomputer-implemented method of claim 4, wherein generating the blockindex further comprises: ordering the vector blocks in the first vectorblock array based on binary values of the vector blocks.
 6. Thecomputer-implemented method of claim 5, wherein generating the blockindex further comprises: encoding binary values of a first vector blockin the first vector block array by representing the binary values of thefirst vector block based on a difference between the binary values ofthe first vector block and the binary values of a previous vector blockin the first vector block array.
 7. The computer-implemented method ofclaim 4, wherein generating the block index further comprises:consolidating one or more vector blocks in the first vector block arraythat include same binary values.
 8. The computer-implemented method ofclaim 7, wherein generating the block index further comprises: encodingvector identifiers of vectors for each consolidated vector block byrepresenting at least one vector identifier based on a change from aprevious vector identifier in the consolidated vector block.
 9. Thecomputer-implemented method of claim 1, wherein the method furthercomprises: receiving a target vector comprising binary values;identifying, using the block index, a set of candidate vectors based ona hamming distance between the target vector and each candidate vectorfrom the set of candidate vectors; and determining, from the set ofcandidate vectors, a set of one or more result vectors.
 10. Thecomputer-implemented method of claim 1, wherein each vector correspondsto a data object represented by a respective vector.
 11. Anon-transitory computer-readable medium storing instructions that, whenexecuted by one or more computer processors of a computing system, causethe computing system to perform operations comprising: receiving atarget vector comprising binary values; accessing a block index storingdata for a set of vectors, each vector from the set of vectorscomprising binary values, the block index comprising a set of vectorblock arrays with each vector block array being associated with acorresponding position in the vectors and identifying binary values of avector block from the corresponding position for each vector, each atleast a portion of the vectors blocks comprising two or more binaryvalues from each vector; and identifying, using the block index, a setof candidate vectors based on a distance between the target vector andeach candidate vector from the set of candidate vectors.
 12. Thenon-transitory computer-readable medium of claim 11, wherein receivingthe target vector comprises: receiving a data object; converting thedata object to a floating value vector; and converting floating valuesin the floating value vector to binary values to provide the targetvector.
 13. The non-transitory computer-readable medium of claim 11,wherein identifying the set of candidate vectors comprises: generating aset of vectors blocks for the target vector based on the correspondingpositions associated with the vector block arrays in the block index;and searching the block index by comparing binary values of each vectorblock from the target vector to binary values for the set of vectorsidentified by each vector block array.
 14. The non-transitorycomputer-readable medium of claim 11, wherein the distance between thetarget vector and each candidate vector comprises a hamming distance.15. The non-transitory computer-readable medium of claim 14, whereinidentifying the set of candidate vectors comprises: determining ahamming distance for each vector from the set of vectors; and comparingthe hamming distance for each vector from the set of vectors to adistance threshold.
 16. The non-transitory computer-readable medium ofclaim 11, wherein the operations further comprise: selecting, from theset of candidate vectors, one or more result vectors.
 17. Thenon-transitory computer-readable medium of claim 16, wherein selectingthe one or more result vectors comprises: determining a distance betweena floating value vector for the target vector and a floating valuevector for each candidate vector; and selecting the one or more resultvectors based on the determined distances.
 18. A system comprising: oneor more processors; and one or more computer-readable media storinginstructions, that when used by the one or more processors, cause theone or more processors to: generating a block index from a set ofvectors comprising binary values by generating a set of vector blocksfor each vector from the set of vectors and indexing data in the blockindex based on the vector blocks, the block index comprising a set ofvector block arrays with each vector block array being associated with acorresponding position in the vectors and identifying binary values of avector block from the corresponding position for each vector, each atleast a portion of the vectors blocks comprising two or more binaryvalues from each vector; receiving a target vector comprising binaryvalues; generating target vector blocks from the target vector;identifying one or more candidate vectors by searching the vector blockarrays of the block index using binary values of each target vectorblock; and selecting one or more result vectors from the candidatevectors.
 19. The system of claim 18, wherein identifying the one or morecandidate vectors comprises: computing a hamming distance between thetarget vector and each vector; and comparing the hamming distance foreach vector to a threshold.
 20. The system of claim 18, whereinselecting the one or more result vectors comprises: determining adistance between a floating value vector for the target vector and afloating value vector for each candidate vector; and selecting the oneor more result vectors based on the distances.